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Studies of coevolution of amino acids witiiin and between proteins have revealed two types of 
coevolving units: coevolving contacts, which are pairs of amino acids distant along the sequence but 
in contact in the three-dimensional structure, and sectors, which are larger groups of structurally 
connected amino acids that underly the biochemical properties of proteins. By reconciling two 
approaches for analyzing correlations in multiple sequence alignments, we uncover a new class of 
coevolving units, called 'sectons'. Sectons provide a conceptual link between coevolving contacts 
and sectors. The methods and results that we present are general, and relevant beyond protein 
structures. This generality is illustrated with an analysis of the co-occurrence of orthologous genes 
in bacterial genomes. 
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The structural and functional properties of proteins 
emerge from interactions between their amino acids. 
During evolution, these interactions constrain the sub- 
stitutions of amino acids that may happen. Sequences 
resulting from multiple independent evolutionary trajec- 
tories reflect these constraints, and therefore contain in- 
formation about the organization of interactions within 
proteins. Such sequences are now made available by DNA 
sequencing technology, which provides thousands of pro- 
tein sequences that have diverged independently and un- 
der similar selective pressures from a common ancestral 
sequence. 

These protein sequences are commonly collected into 
multi-sequence alignments on the basis of their sequence 
similarity. Alignments for over 10^ families of protein do- 
mains are for instance available in the Pfam database [I] . 
Formally, an alignment is described by a M x L x bi- 
nary array a;^j, where x'^^ = 1 indicates that sequence 
s e {1, . . . ,M} has amino acid a G {1, . . . ,A} at posi- 
tion i e {1, . . . ,L}, with x% = otherwise; some posi- 
tions contain gaps, inserted to ensure an optimal align- 
ment of sequences, which are represented by x'^^ = for 
every amino acid a = 1, . . . ,A, where A = 20. Typical 
numbers are M ~ 10^-10'' for the number of sequences, 
L ~ 10^-10'^ for the length of the alignment. 

The pattern of amino acid interactions may be inferred 
from the statistical correlations between pairs of posi- 
tions in the alignment. Analyses of these correlations 
are complicated by several factors: (i) proteins are gath- 
ered in an alignment based on sequence similarity, with 
no guarantee to have been subject to common selective 
constraints; (ii) sequences are not sampled independently 
during evolution, but through a branching process, which 
introduces a sampling bias; (iii) the information content 
of the alignment, ^ ML log2 A ^ 10'''-10'' bits, is small 
compared to the number ^ A^L'^/2 ^ 10^-10* of con- 
tinuous parameters defining the correlations between ev- 
ery pair of amino acids, which implies a severe under- 
sampling; (iv) two positions may be correlated while not 
directly interacting, reflecting a fundamental difference 



between interactions and correlations. 

Standard statistical analyses assimilate the observed 
samples to an asymptotically large number of inde- 
pendently and identically distributed random variables. 
Points (i), (ii) and (iii) violate each of these assumptions, 
while point (iv) suggests that, even in absence of bias, 
further processing is required to infer interactions from 
correlations. 

Many approaches have been proposed to tackle these 
challenges [2]. Recently, two methods have been devel- 
oped, each rooted in a different concept of statistical me- 
chanics, and each providing results of different nature. In 
an extension of an approach called Statistical Coupling 
Analysis (SCA) an application of concepts from ran- 
dom matrix theory [4] to address (iii) has revealed collec- 
tive modes of coevolution named 'sectors' [5] . A protein 
sector consists of ^ 15-30 positions that are connected in 
the three-dimensional structure, and experiments indi- 
cate that each sector controls independently a biochem- 
ical property of the protein [S]. In a different approach 
called Direct Coupling Analysis (DCA) [6], the problem 
(iv) of inferring interactions from correlations was for- 
mulated and solved as a problem of inverse statistical 
mechanics, leading to the inference of a large number 
of pairs of positions in contact in the three-dimensional 
structure [3- 

The two approaches, SCA and DCA, differ in their 
principles as well as in their results. In a comparison us- 
ing a common alignment, the contacts inferred by DCA 
seemed to bear no relation to the sectors identified by 
SCA [7]. We provide here a rational for these apparently 
dissonant results. We show (1) how the two approaches 
expose two aspects of a common pattern of amino acid 
interactions; (2) how new units of coevolution can be 
defined, which underly the contacts inferred by DCA. 
We name these elementary units 'sectons', and illustrate 
their relation to sectors with the trypsin family of en- 
zymes. 

Our arguments are general, and the notion of sectons 
relevant not only to other protein families, but also to 
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datasets of different nature. We demonstrate it by ap- 
plying the same methods to the co-occurence of orthol- 
ogous genes in bacterial sequences, also known as their 
phylogenetic profile [S]. We show (3) how sectons can 
be identified at the scale of the genome, which define 
elementary units of co-functional genes. 

We begin with an analysis of coevolution in the trypsin 
family of protein sequences, using the alignment from 
Pfam [1], which contains nearly 15000 sequences. SCA 
and DCA both start from the same correlation matrix 
C°j, which reports the coevolution of amino acid a at 
position i with amino acid b at position j. Prior to defin- 
ing this matrix, some steps must be taken to clean the 
alignment from positions with excessive gaps and miti- 
gate the effects of (i) and (ii) by weighting differentially 
the contributions of the various sequences. These steps 
are straightforward but essential, and may be common 
for both approaches (all details are provided as Supple- 
mentary Material f^). The outcome is the definition of 
/f , the frequency of amino acid a at position i, and f^J', 
the joint frequency of (a, b) at the pair of positions («, j). 
These frequencies define the correlation matrix Cff as 
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SCA aims at identifying groups of positions under se- 
lection for a common functional property, based on two 
principles: the conservation of amino acids involved in 
the function, and their correlations induced by cooper- 
ative interactions. SCA takes a heuristic approach to 
combine these two principles by weighting the correla- 
tions Cfj with a measure of amino acid conservation W"", 



FIG. 1: Protein sectors in the trypsin family, as inferred from 
the Pfam alignment PF00089 [1 - (A) Projections of the 
positions i along the vectors V^*^^ obtained by rotating by 
ICA the top 4 eigenvectors of dj. Sector k is defined by the 
positions i with V^*^^ > e and 1//*' < e for I ^ k, with e = 0.1. 
(B) Location of the sectors on a three-dimensional structure 
of tryspin [13]. (C) Location of the sectors along the sequence 
(cut in two for readability), with non-sector positions in white 
(numbering system of bovine chymotrypsin) . 
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these exact values; increasing fctop or decreasing e in fact 
provides complementary information [S] . The sectors are 
represented with different colors in Fig. IT] (for simplicity, 
these sectors do not include the positions i with ]/> > e, 
for multiple /c [9]). 

As shown in Fig. [T] each sector forms a connected 
group of positions on the three-dimensional structure, de- 
spite not necessarily consisting of consecutive positions 
along the sequence. Sectors have no sharp boundaries, 
but are typically organized into an onion-like hierarchy, 
with the core of sector k consisting of positions i with 
largest V^ , and layers associated with decreasing val- 
ues of V^'''^ [5]. Three sectors were previously inferred for 
the same protein family using an alignment about 10 time 
smaller [SUS]: two of these sectors, respectively associated 
with enzymatic activity and specificity, correspond to the 
green and red sectors in Fig. [l] the third one, which had 
the peculiarity of a disconnected core, and which corre- 
lated experimentally with stability, is now partly spread 
over two new sectors, whose functional role remains to 
be characterized. 

In contrast to SCA, DCA aims at identifying struc- 
tural contacts between positions by inferring direct in- 
teractions from indirect correlations. It proceeds by a 
mapping to the problem of reconstructing the couplings 
4 and e = 0.1, but the results are insensitive to Jf^ of a {A + l)-state Potts model given its correla- 



where g° = X]i=i fi I the mean frequency of amino 
acid a; thus, the more /f deviates from and ap- 
proaches 1, the larger W°- is. Taking W^W^Cf-' defines 
a conservation-weighted correlation matrix, which is re- 
duced to a L X L correlation matrix between positions as 
follows 13 HU]: 



(3) 



Following an approach first proposed to infer busi- 
ness sectors from correlations between financial time se- 
ries [4], protein sectors are identified from the top eigen- 
vectors of this matrix [5 . This definition is facilitated 
by rotating the top eigenvectors into maximally indepen- 
dent components, using independent component analysis 
(ICA) [m [12] . Sectors can be defined from the top fctop 
components F^'"'^ as Sk = {i V^^^ > e}. Here, we take 

fctop ' " ^ ^" 
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tions C^j'. Solving exactly this problem is computation- 
ally prohibitive, but mean-field approximations provide a 
range of alternatives [M] . The simplest of these approx- 
imations consists in taking J = — C~^, where C, given 
by Eq. ([T]), is treated as a (AL) x (AL) matrix. As C 
has typically rank < M, it is not invertible but it can be 
regularized to 

C = AC+(1-A)Q, Q^!;^q^{S'^'-q')S,„ (4) 

where Q represents a background expectation, so that we 
can take J = —C^^ (with A = 1/2 here as in Ref. [7]). 
Regularization is not necessary for SCA, but substituting 
Cf/ for Cf/ in Eq. ^ does not alter the definition of 
sectors [H]. 

The couplings J^j' define for i ^ j & model for the 
distribution of amino acids at every pair of positions ij , 



jt^ = exp{J^]' + h^ + h'; + h„), 



(5) 



where hf, h'j, ho are uniquely determined by requiring 
that Efcftt = Pa with fl = A/f + (1 - A)g^ and by 
imposing an overall normalization. From gf^, a matrix 
of 'direct information' W is defined by 
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As shown in Ref. [7], many of the pairs ij with top val- 
ues T^ij are in contact in the three-dimensional struc- 
ture, to the extent that these contacts provide sufficient 
constraints to infer the structure [121 (contacts are here 
defined by a distance < 8 A). 

In this work, we follow Ref. [7] in defining Vij by 
Eq. ^, but we truncate it before analyzing it by ICA 
as we did for Cij. Many of the top pairs in terms of Vij 
are indeed induced by the presence of stretches of gaps 
and are therefore consecutive along the sequence; to fo- 
cus on non-trivial contacts, we substitute Vij with Vij, 
where Vij — Vij if |i — j| > A, and otherwise. We take 
here A = 5, but other values give consistent results. 

As Cij, the matrix Vij can be analyzed by extracting 
its top eigenvectors and rotating them by ICA. Remark- 
ably, this leads to a large number (~ 100) of indepen- 
dent components, each localized on a small group of 2 to 
5 positions. Fig. [2] shows the first 24 such groups using 
^max = 120 and e — 0.2, but similar results are obtained 
for a range of values of fc,„ax and e [9j. Out of the 24 
sectons shown in Fig. [2j only one, the 23rd, contains a 
position that is not in contact with the others. We call 
these small structural elements 'protein sectons'. Sec- 
tons underlie the contacts inferred by DCA, but, in a 
majority of cases, they consist of more than two posi- 
tions. Only few sectons are well-recognized structural or 
functional units: for the trypsin family, the top 6 sec- 
tons thus include 4 disulfide bonds (pairs of cysteines 
forming covalent bonds) [2^, and the 'catalytic triad', a 
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FIG. 2: Top protein sectons in the trypsin family - Each 
graph is a projection of the positions along ((/'''^ L^'-''"^^-'), 
the components of order k and -I- 1 obtained by rotating by 
ICA the top eigenvectors of the truncated matrix of direct 
information "Dij. Sectons are defined by Sfe = {i : i7^'°' > e}, 
with e = 0.2. The labeling of positions follows the numbering 
system of bovine chymotrypsin (note that in several instances 
positions appear as superimposed). Positions in a secton are 
joined by a line when their distance is < 8 A; by this criterion, 
all sectons shown here are structurally connected, except for 
S23, where position 37 is distant from the others. The colors 
are from Fig. [T] showing that sectons follow the decomposi- 
tion into sectors (non-sector positions are in white). Sectons 
S2, S3, S5, S6 are disulfide bonds, and S4 is the catalytic triad. 



combination of three amino acids shared by several other 
protein families jl6] . The other sectons have not been 
described previously and characterizing their structural 
and/or functional roles is a clear experimental challenge. 

Formally, sectors and sectons can be associated with 
two exclusive parts of the spectrum of a common corre- 
lation matrix, C [5]. They are however not unrelated: 
sectons are found within or outside sectors, but not be- 
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tween sectors, as seen in Fig. [2j This absence of inter- 
sector sectons reflects the evolutionary independence of 
sectors. Sectons thus define elementary units of coevo- 
lution that are consistent with the overall decomposition 
into sectors. 

Sectons are similarly found in other protein fami- 
lies [in], but the concept is not limited to protein struc- 
tures. As another example involving biological sequences, 
but at another scale and with data of different nature, we 
consider here the problem of inferring the functional cou- 
plings between genes in a genome. A first-order approach 
to this problem is to study the co-occurrence of genes in 
a large number genomes, with, as raw data, an M x L 
binary matrix x^i-, where Xsi = 1 indicates that gene i 
is present in the genome of species s, and that it is 
absent {A = 1 in this case). Building such a dataset re- 
quires mapping corresponding genes across genomes: we 
rely here on the partition of bacterial genes into clusters 
of orthologous genes (COGs) [17], to obtain a dataset 
consisting of M ~ 10^ genomes and L ~ 1.5 10^ orthol- 
ogous classes of genes [5]. 

The same methods lead to the identification of many 
genomic sectons, the first of which are displayed in Fig. [3] 
Many of these sectons have a clear structural or func- 
tional interpretation: several are different subunits of a 
same protein complex, and others are known to be in- 
volved in a common function [S]. Some sectons, however, 
cannot be easily interpreted, as they involve currently 
uncharacterized proteins: for such cases, our results pre- 
dict new functional relations. Genomic sectors, involving 
larger groups of genes, may be defined as well, although 
their significance is more difficult to assess [3]. 

The methods allow for many variations, and we pre- 
sented them only in their simplest instantiations. Many 
alternative measures for scoring coevolution are for in- 
stance possible. Among them, a correlation matrix as 
in Eq. ^ but without weights {W^ — 1 for all z, a), or 
the 'mutual information' [18) . can be analyzed along the 
same lines: interestingly, these matrices yield coevolv- 
ing elements that are intermediate between sectors and 
sectons [9]. More sophisticated methods should allow 
for better characterization of the patterns of coevolution 
in biomolecules, in particular to account for their hier- 
archical organization. New approaches will for instance 
integrate phylogenetic data, or take advantage of recent 
progress in inverse statistical mechanics [14) and high- 
dimensional statistics |19J. 

In conclusion, we extended the previously reported 
decomposition of proteins into protein sectors [5] to a 
decomposition into a hierarchy of structural elements, 
with a new class of coevolving units, sectons, at its basis. 
Sectons, rather than pairs of contacting positions, may 
be the relevant target for methods aiming at inferring 
protein structures from multiple sequence alignments. 
We also provided evidence that the same pattern of 
coevolution extends beyond protein structures, to the 
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FIG. 3: Top genomic sectons in bacteria - As Fig. [2] 
but for the co-occurrence of orthologous genes in bacterial 
genomes. Each dot represents a cluster of orthologous genes 
(COG), with a colors associated with its functional class; 
cyan for metabolism, yellow for cellular processes, magenta 
for information processing, and gray for poorly characterized 
genes [17]. The sectons typically comprise genes from a com- 
mon functional class, and often even subclass, which is indi- 
cated by the last letter labeling the COGs. 



scale of the genome. Only a small fraction of protein 
or genomic sectons have been previously recognized 
as fundamental units by means of other approaches. 
Characterizing generally the structural, functional and 
evolutionary roles of sectons is an open problem that ex- 
tends beyond the scope of statistical studies of sequence 
data. In particular, experiments are needed to assess 
the extent to which statistical patterns of coevolution, 
inferred from a collection of sequences, are reflected in 
individual biomolecules. 
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SUPPLEMENTAL MATERIAL 



Preprocessing of the alignment 



As input for the identification of sectors and sectons in the trypsin family, we downloaded the full alignment 
PF00089 from Pfam (version 26.0, Nov. 2011) 1]. This alignment contains Mq = 14720 sequences. It is represented 
by an array a;^j where s labels the sequences (row in the alignment), i the positions (columns) and a is a number 
between 1 and 20 (each number is associated with one of the 20 amino acids) ; x", = 1 indicates that sequence i has 
amino acid a at position z, and = otherwise. As a reference for truncating the alignment and comparing to 
structural data, we used the sequence and structure of rat trypsin, chain E of the PDB id 3TGI in the Protein Data 
Base jlSi, which consists of Lq = 223 positions. 

To clean the alignment from an excess of gaps, the following operations were performed: 

(1) Truncation of positions based on the reference sequence. As the alignment does not contain the last 7 positions 
of the reference sequence, this step leaves Li = 216 positions. 

(2) Removal of sequences with a fraction of gaps exceeding 7soq — 0.2, or with a sequence similarity to the reference 
sequence below Smin = 0.2, where sequence similarity is defined by 

This step leaves M — 9589 sequences. 

(3) Removal of positions with a fraction of gaps exceeding 7pos = 0.2. The frequencies of gaps are computed with 

sequence weights as defined by Eq. ([s]), using S^J . This step leaves L = 204 positions. 

The parameters 7soq — 0.2, 7pos — 0.2 and Smin — 0.2 are chosen to mitigate the effects of gaps, but the results 
are not sensitive to their exact values. A more in-depth analysis of the structure of sequence correlations can reveal 
further information, and may suggest the removal of additional sequences, but this analysis is beyond the scope of 
the present study ^0,. 



Sequence-weighted frequencies 

Following Ref . [S] , the uneven sampling of sequences is alleviated by introducing sequence weights defined by 

Ws = ^ _i , with i^s = \{r ■■ Srs > S}\, (8) 

i.e., Vr counts the number of sequences r within distance 5 of sequence s, where the distance between two sequences 
is the fraction of amino acids by which they differ: 



L 

M' = v^^ can be interpreted as an effective number of sequences in the alignment. Here we take 5 = 0.8, which 
results in M' ~ 4600 effective sequences for the trypsin alignment. 

The sequence weights are used to define frequencies as 

ft^Y. ' /s' ^ E ^sxi,<3 ■ (10) 



Rotation by independent component analysis (ICA) 

Different implementations of ICA use different measures of independence and different algorithms for optimizing 
them. Here, we use one of the simplest implementations of ICA, proposed by Bell and Sejnowski [llj . with modi- 
fications introduced by Amari [S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal 
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separation. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in neural information processing sys- 
tems, volume 8, pages 757-763, Cambridge MA, 1996. MIT Press.]. We take as input the top k eigenvectors of the 
correlation matrix Cij or Vij , which we concatenate in a fctop x L matrix Z (at variance with usual implementations 
of ICA using as input the dataset X). The algorithm iteratively updates an unmixing matrix W , starting from the 
^top X fctop identity matrix Wq = Ikt^^ , with increments AW given by 

The parameter 77 is a learning rate that has to be sufficiently small for the iterations to converge; in this study, 
T] = 10"'' led to convergence after 10'* iterations in every case. 

The independent components F'-'^^ (or C/'^'^)) are obtained by applying W to the eigenvectors in Z. To set their 
overall scale and sign, we normalize them to unit length (X)i(^/'^'')^ = 1) ^iid orient them so that the position i with 
largest \v}'^^\ satisfies f/'^'' > 0. The order of the independent components, which is generally not prescribed by ICA, 
is here well defined by the algorithm and is related to the order of the principal components. 



Threshold fctop in defining sectors 

The spectrum of Cij, displayed in Fig. ^ indicates that between 4 and 8 eigenvalues are emerging from a bulk of 
small eigenvalues. This estimation is confirmed by comparing with the spectra of randomized alignments, where the 
amino acids are drawn independently at each position i according the frequencies /° so as to remove the correlations 
but preserve the distribution of amino acids at each position. 

In the main text, we presented the results when selecting fctop = 4 modes. A smaller number of components may 
prevent the discrimination between sectors, as shown in Fig. ^ where taking fctop = 3 causes the red and blue sectors 
to appear along a same component. Reciprocally, a larger number of components may lead to the splitting of a sector 
into disconnected subsets, as shown in Fig. ^ where fctop — 5 decomposes the purple sector along two components. 
In this case, the two components do not define two new sectors, but rather indicate a partition of the sector into a 
core and a periphery, as shown in Fig. ^ As increasing values of fctop are considered, sectors break up successively 
into smaller components, as seen in Fig. £5] with fctop = 24 . 



Threshold e in defining sectors 

Besides the threshold fctop for the eigenvalues, the definition of sectors involves a threshold e determining which 
positions contribute significantly to each component. Here again, e can be estimated from a comparison with 
randomized alignments, but it is more interesting to notice that several values of e are consistent with structurally 
connected sectors. Varying e thus defines a hierarchy of structurally connected positions, from the core of the most 
conserved positions for large e to a periphery of less conserved positions for smaller e. 

As an illustration of this feature, we show in Fig. ^how the connectivity of each sector, measured by the relative 
size of its largest structurally connected subset, varies with e. With the possible exception of the cyan sector, the 
sectors are found to be significantly structurally connected for nearly all values of e. The significance of this finding 
is assessed by comparing with randomly-formed groups of positions, or with the positions ordered by their overall 
degree of conservation. 



Intersections between sectors 

In Fig.jlj we eluded the discussion of positions at the intersection between different Sk — {i ■ V^^^'' > e} by excluding 
them from the definition of sectors. These few positions are however structurally meaningful as well: Fig. ^ shows 
that they are always found at the periphery of one of the two sectors, and, in several instances, at the structural 
interface between them. 
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Composition of sectors 



Fig. ^reports the exact composition of the 4 sectors defined in the main text. The green and red sectors have 
very significant overlap with the green and red sectors previously defined in Ref. using a smaller alignment. On 
the other hand, the purple and cyan sectors have only limited overlap with the blue sector defined in this previous 
study. As a visual representation of the relation between these two definitions, Fig. ^ reproduces Fig. [l] but with 
colors corresponding to the 3 sectors defined in Ref. [5]. 



Contacts and sectons 



In computing the matrix of direct information Vij we follow the DC A of Ref ^7], with only minor differences: 
(1) we trim the alignment from sequences with significant dissimilarity to the reference sequence, as part of the 
preprocessing steps; (2) we use background frequencies q" computed from averages over the alignment, rather than 
= + 1), for consistency with SCA, and with no major incidence; (3) we regularize/shrink the covariance 

matrix C instead of using pseudo-counts for the frequencies, which has also minor consequences. Finally, we truncate 
Vij from its diagonal, which has no incidence on the prediction of contacts when ranking pairs by their value of direct 
information if restricting to non-trivial contacts, defined as those at distance > A = 5 along the chain. Fig. ^T0| 
reports the performance of DCA for predicting contacts for the alignment under study: with only one exception, 
all top 33 predicted and non-trivial contacts are actual contacts. Fig. ^TT] gi ves the list of the top 80 non-trivial 
predictions of contacts and indicates the secton to which they belong. Fig. ql2| shows the top 24 sectons of Fig. [2] on 
the three-dimensional structure of trypsin. 



Fig. £14 shows that ^ 90% of the top 60 sectons are structurally connected. It also shows that, as for sectors, the 
results are insensitive to the choice of the threshold e used in defining the positions contributing significantly to a 
component. In all figures involving sectons, we use /ctop ~ 120, to be able to consider a large number of sectons, but 
the results are here also insensitive to this exact value. Finally, Fig. ^T3| gives the composition of the top 120 sectons 
and indicates which are connected. 



Not truncating Vij leads to the sectons shown in Fig. £ 15 Many of these sectons consist of consecutive positions. 



These trivial sectons are induced by gaps, which tend to be consecutive along the sequence (this feature is partly a 
consequence of the multiple sequence alignment algorithm, which have a penalty for opening new gaps). Truncating 
2? (or equivalently J) is a simple if not optimal way of getting rid of these trivial correlations. 



Orthogonal decomposition of the correlations 



To show how sectors and sectons can be derived from two distinct parts of a common correlation matrix, we take 
here the regularized correlation matrix C and decompose it into two orthogonal parts and C~ . More precisely, if 
C = J2k denotes the spectral decomposition of C in the bra-ket notation, with Ai > . . . A^, we define 



and C- = J2 



(12) 



k<k' 



k>k* 



We then apply SCA to C+ instead of C, and DCA to J" = - \k)\l 
Fig. £ 16 shows that the same sectors are recovered from C and C'"'", and Fig. 
from C and C" . 



(fcl instead of J 



-C-i. For k* 



17 



20, 



that the same sectons are recovered 



Random matrix theory indicates that both ends of the spectra of under-sampled empirical covariance matrices are 
statistically significant. The few top modes support the sectors. The bottom modes are essential for defining sectons 
(and therefore for inferring contacts by DCA) but they are n ot s ufficient: sectons are not exclusively associated with 

(but they are, by definition, associated with the top 



the bottom modes of Cf^ 

''J 



as also suggested by Fig. a5 and 



18 



modes of 'Dij). Thus, the number of modes included in the definition of C cannot be reduced significantly without 



altering the nature of the sectons recovered from it, even though most of the 20L 
statistically significant. 



4080 modes of C may not be 
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Unweighted correlations and mutual information 



The operations applied to Cij and Vij to respectively obtain sectors and sectons, i.e., extraction and rotation by 
ICA of the top eigenvectors, can be applied to other measures of correlations. In particular, it can be applied to the 
matrix of unweighted correlations C^j obtained from Eq. ^ by taking flat positional weights = 1 for all i,a. 
Groups of coevolving positions of size intermediate between sectors and sectons, and consistent with both, are thus 
defined, as shown in Fig. 318) 



The matrix of mutual information Mij can be analyzed similarly. This matrix is defined by 



20 rab 

A^.,=^/,fln^, (13) 
0=0 •'i 



with a = corresponds to a gap, so that /f = 1 — X]a=i fi ^^'^ similarly for f^-', and /f ". Fig. S 19 A) shows 



that its top components do report structurally connected of positions, but most of them are trivial, i.e., consist 
of consecutive positions along the sequence. As for direct information, we can however truncate a band along the 
diagonal of this matrix to obtain non-trivial groups of correlated positions. These groups of coevolving positions 
are again both structurally connected and consistent with the decomposition into sectors and sectons, as shown in 
Fig. dWB). 



Co-occurrence of orthologous genes in bacterial genomes 



Sequenced bacterial genomes and COG annotations were downloaded from NCBI. The initial dataset contained 
Mo = 1432 genomes and Lq = 4467 COGs. 

The following cleaning steps were conducted: 

(1) Removal of 'exceptional' genomes with size below 500 kpb or with no less than 60 % of genes annotated by 
COGs. This step leaves M — 1108 genomes. 

(2) Removal of COGs that are present in less than 7c = 0.4 of the genomes, where gene frequencies are computed 
with sequence weights using S — 0.9. The relatively high value 7c — 0.4 is meant to reduce the data to a size that 
is easily tractable computationally; here it leaves L = 1474 COGs. Conversely, the choice oi 5 = 0.9 is meant to 
preserve a relatively high effective number of genomes, here M' = 380. The exact values of these parameters are 
however not crucial. 



The data is represented by a Af x L binary array Xsi with Xsi = 1 if genome s has at least one gene in COG i, 
and otherwise. The average occurrence of genes is g = J2 si •'^ si / L) — 0.67. This dataset is in no way meant to 
be optimal, and finer definitions of orthology are possible. Our point here is to show that sectors and sectons can 
be unraveled even from a relatively crude construction of bacterial phylogenetic profiles, leaving for future work the 
study of more elaborated datasets. 



Genomic sectons are obtained exactly as protein sectons, except that the alphabet is now binary (A = 1), 
sequence weights are computed with S — 0.9, and 2?jj is not truncated (A — 0). The content of first 100 sec- 



tons (obtained with fctop = 120) is reported in Fig. £ 20 with further details for the top 24 sectons provided in Fig. £ 23 



Genomic sectors can also be defined following the methods for defining protein sectors. Fig. £21 shows the counter- 
part of Fig. [ij using here fctop = 6. In absence of a counterpart for the experimentally determined three-dimensional 
structure of proteins, assessing the relevance of these sectors is not obvious. Using the partition of COGs into 3 broad 
functional classes [T7], we nevertheless find that 4 of the 6 components support groups of COGs that are significantly 
enriched in some of these classes, as reported in Fig. ^22) This suggests that genome sectors may be defined as well, 
which contain co-functional and therefore coevolving genes. 



10 



l-l 

a; 

a 




ji I i 



10 12 

Eigenvalues 



FIG. S 1: Spectrum of dj for the trypsin family - In blue, histogram of the L eigenvalues of the matrix dj (truncated to 20 
along the j/-axis). In red, average spectrum over 100 randomized alignments where the amino acids are drawn independently 
at each position i according to the frequencies /f . This shows that between 4 and 8 eigenvalues may considered as significant. 
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FIG. S 2: Independent components from rotation by ICA of the top fctop ~ 3 eigenvectors of dj - This figure is the counterpart 
of Fig. [Tl for fctop = 3 instead of fctop = 4, and with positions colored as in Fig.0 The same green and red sectors are defined 
along and V^'^-', but the red and cyan sectors appear together along V^^\ 




FIG. S 3: Independent components from rotation by ICA of the top fctop ~ 5 eigenvectors of dj - This figure is the counterpart 
of Fig. [l] for fctop ~ 5 instead of fctop = 4, and with positions colored as in Fig. [l] The definitions of the green, red and cyan 
sectors along 1/(2) , v(3) ^ y(4) 

are consistent with their definition with fctop = 4, while the new component V^^^ decomposes the 
purple sector in two subsets, whose interpretation is given in Fig. ^ 
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FIG. S 4: Interpretation of the decomposition of the purple sector when considering fctop = 5 - (A) Same as the last graph of 
Fig. ^ except that the positions from the purple sector are indicated with three different colors, depending on whether they 
satisfy v/^' > 0.2 (pink), v}'^^ > 0.2 (purple) or neither (light pink). (B) Using this coloring scheme but now for the projection 
along components obtained with fctop = 4 as in Fig. [l] showing that the partition of the purple sector corresponds to a partition 
between its core, defined as the positions with highest contribution to V^'^', and the others. (C) Location of the positions on 
the three-dimensional structure, showing that the core is structurally connected with the other positions around it. 
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FIG. S 5: Independent components from rotation by ICA of the top fctop = 24 eigenvectors of Ci-, - This figure is the counterpart 
of Fig.[l] for fctop ~ 24 instead of fetop = 4, and with positions colored as in Fig.[T] All sectors are now split up into smaller units 
that prefigure the sectons. Lines between the positions indicate structural contacts (for clarity, these contacts are represented 
only for positions with contribution > 0.2 along each component). 
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£ £ rank 

FIG. S 6: Connectivity of the sectors for varying values of e - (A) Fraction of the positions in the largest structurally connected 
subset of a sector for varying values of e and for sectors defined as in the main text by S'k^ {i: V}''^ > e, Vf' < e, I ^ k} 
(fe = 1, . . . , fctop = 4). The colors correspond to those of Fig. [l] magenta for k = 1, green for fc = 2, red for fc = 3 and blue for 
k — 4. (B) Same as (A) but not excluding intersections, which are significant for small e, i.e., taking Sk = {i '■ V/*' > e}. (C) 
To assess the significance of (A) and (B), we consider also two other groups of ordered positions: randomly ranked positions and 
positions ranked by degree of conservation, measured by the relative entropy Di = X^aio ^"^ii I'f with /f = 1 — X]a=i 
and g" = 1 — X^aii f'-"' frequencies of gaps . The results are presented here as a function of the size of the groups, where 
positions are added according to their rank (given by V/''' for the sectors). Compared to randomly ranked positions (full black 
line), sectors are clearly significantly more connected. They are also more connected than positions ranked by conservation 
(dashed black line), with the possible exception of the cyan sector (blue line). 




FIG. S 7: Intersections between Sh - In the main text, sectors are defined by 5^ = {i : V^^^ > e, V}^^ < e, I ^ k} 
(k — 1,..., fctop = 4, e — 0.1). Here we represent on the three-dimensional structure the few positions that are excluded 
as ambiguous, i.e., belonging to intersections Sk n Se for k £, with Sk = {i '■ V}''^ > e}. There are 7 such positions, all of 
which are at the structural periphery of one of the two sectors S'k or S'l, including 4 that are at the structural interface between 
the two. (A position from the green sector appears as disconnected in (A), but the position that could connect it to the rest 
of the green sector is actually not included in the alignment). 
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FIG. S 8: Composition of the sectors defined in Fig. [l] ranked by the corresponding value of V^ ''-' - A star indicates for the 
purple and cyan sectors that the position belongs to the blue sector defined in Ref. 5,, and for the green and red sectors that 
it belongs respectively to the green and red sector of Ref. [5]. 
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FIG. S 9: Comparison with the sectors defined in Ref. |5] - The graphs are identical to those of Fig. [l[A), except that the 
positions are here colored according to the definition of the 3 sectors of Ref. 5.^. This shows as in Fig. 98] that essentially the 
same green and red sectors are identified, while the purple and cyan sectors have a small overlap with the previously defined 
blue sector. 
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FIG. S 10: Fraction of pairs of positions predicted to be in contact from the top values of the matrix of direct information 
"Dij - Pairs ij of positions are ordered by decreasing values of T>ij, and the fraction of top n pairs to be in structural contact 
(distance < 8 A) is considered as a function of n. Different curves correspond to different values of A, where pairs at distance 
< A along the sequence are ignored. 
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FIG. S 11: Top 80 non-trivial contacts and their associated secton - The pairs are ordered by decreasing values of Vij, with 
their rank indicated in the first column. Only pairs of positions at distance > A = 5 along the sequence are considered, and 
many pairs are therefore not included (red curve in Fig. ^10[ ). The second column indicates the rank of the secton where 
the pair is found and the third the distance between the positions in the three-dimensional structure. Pairs are considered in 
contact when this distance is < 8 A, and false positive are indicated in red. Note that most of these false positive are actually 
not in a secton: if considering the 56 pairs that both in one of the top 80 non-trivial pairs and in one of the top 120 sectons 
(all those shown here), only 2 are false positive (96% positive rate) while out of the top 56 non-trivial pairs 7 are false positive 
(87% positive rate); this suggests that integrating sectons into the prediction of contacts may lead to an increased positive rate. 




FIG. S 12; Structural representation of the top 24 sectons - Representation on the three-dimensional structure of rat trypsin 
of the top 24 sectons defined in Fig. [2] The colors refer to the sectors, with yellow for non-sector positions. Only secton 23 
contains a position that is disconnected from the others when taking a distance < SA as criterion for physical connectivity. 
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FIG. S 13: Composition of sectons - Sectons are defined here with fctop ~ 120 and e — 0.2. A star indicates that the secton is 
structuraUy connected. Up to rank 40, aU but one secton are thus connected. Note that at large rank, some sectons consist of 
single positions and are therefore trivially connected; see also Fig. ^14[ 




ranked sectons ranked non-singleton sectons 



FIG. S 14: Connectivity of sectons - Sectons are here defined for fctop = 120 and different values of e. A secton is considered 
as structurally connected if all the positions that it contains are in contact in the three-dimensional structure, either directly 
or indirectly. (A) Fraction of the top sectons to be structurally connected as function of the rank up to which sectons are 
considered. For comparison, the black curve corresponds to the top 2 positions contributing to U'^'^K (B) Same as (A) but 
excluding the sectons that contain a single positio n an d are therefore trivially connected. These figures show that by considering 
a larger, more stringent threshold e than in Fig. S 13 (where e — 0.2) more (but smaller) sectons are found to be connected. 
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FIG. S 15: Sectons when not truncating T>ij to T>ij - Out of the 24 sectons displayed here, 5 consist exclusively of consecutive 
positions and may be attributed to consecutive gaps. 
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FIG. S 16: Sectors from the top 20 modes of C - We compare here the top fctop = 4 independent components derived from C 
and C'^ to show that they are highly correlated and therefore define the same sectors. 
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FIG. S 17: Sectors from the bottom modes of C - We show here the top 24 sectons derived from C . A comparison with Fig. [2] 
shows that essentially the same sectons are defined by this matrix and by C. 
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FIG. S 18: Sector analysis from the unweighted correlation matrix Cfj - Uniform positional weights are used {Wi = 1 for all i 
and a) and the top fctop ~ 12 components are considered. Lines between the positions indicate structural contacts (for clarity, 
these contacts are represented only for the top positions along each component). The group of coevolving positions defined 
by the components are intermediate between sectors and sectons. As far as the top 6 components are concerned, components 
k = 2 and fc = 6 thus correspond to the cores of the red and green sectors while components 3, 4 and 5 correspond to the 
sectons 13, 9 and 8 defined in Fig. 12] but with one or more additional positions (component 1 is here not localized). 
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FIG. S 19: Sector analysis from the matrix of mutual information Mij - (A) Similar to Fig. S 18 but using Mij instead of Cf^. 
(B) After truncating Mij from a band in its diagonal, i.e. using Mij with Mij = Mij when \i — j 
The truncation reveals more structurally connected groups of positions that are not consecutive. 
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56 
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lOlOH 1797H 2875H 2082H 2242H 2243H 
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FIG. 20: Content of the top 100 genomic sectons in terms of COGs - The full content of secton 32 is {1677NU 1684NU 1766NU 
1157NU 1815N 1843N 1868N 1256N 1291N 1298NU 1338NU 1344N 1987NU 1377NU 1536N 1558N} and the full content of 
secton 57 is {1663M 1043M 1212M 2877M 763M 774M 1519M 794M 1560M}. 
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FIG. S 21: Genomic sectors (fcmax = 6) - COGs are colored as in Fig. [s] according to the functional category to which 
they belong: cyan for metabolism, yellow for cellular processes, magenta for information processing, and white for poorly 
characterized genes or genes that belong to multiple cate gori es 117] . The apparent enrichment of yellow COGs along V^^^ or 



cyan COGs along 1/'*' is quantitatively estimated in Fig E 22 
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p-value for the composition 


1 


22 


1 


4 


13 


0.14 


2 


56 


5 


25 


17 


5 10"^ 


3 


74 


2 


35 


23 


5 10-^ 


4 


114 


8 


10 


85 


3 10~^ 


5 


50 


12 


7 


24 


0.37 


6 


65 


2 


4 


32 


3 10"" 



FIG. S 22: Association between genomic sectors and functional ca tego ries - For each sector, defined as the COGs with 

and not along any other, we assessed the significance 
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contribution > 0.05 along one of the 6 components V^'''^ shown in Fig.S 
of their content in information processing (magenta), cellular processing (yellow), metabolic (cyan) COGs , which represent 
respectively around 1/2, 1/4 and 1/4 of the COGs. The p-value is computed from a x^-square test with 2 degrees of freedom. 
4 of the 6 sectors can be considered to be significantly enriched in COGs of some of these functional categories. 



Secton 



COG 



10Q6P 
1863P 
1320P 
651CP 



1984E 
2049E 
1540R 



559E 

410E 

411E 

4177E 

683E 



4603R 
3845R 
1079R 
1744R 



1135P 
2011P 
1464P 



Annotation 



Multisubunit Na+/H+ antiporter, MnhC subunit 
Multisubunit Na+/H+ antiporter, MnhE subunit 
Multisubunit Na+/H+ antiporter, MnhG subunit 

Formate hydrogenlyase subunit 3/Multisubunit Na+/H+ antiporter, MnhD subunit 



Allophanate hydrolase subunit 2 
AUophanate hydrolase subunit 1 

Uncharacterized proteins, homologs of lactam utilization protein B 



Branched-chain amino acid ABC-type transport system, permease components 
ABC-type branched-chain amino acid transport systems, ATPase component 
ABC-type branched-chain amino acid transport systems, ATPase component 
ABC-type branched-chain amino acid transport system, permease component 
ABC-type branched-chain amino acid transport systems, periplasmic component 



ABC-type uncharacterized transport system, permease component 
ABC-type uncharacterized transport systems, ATPase components 
Uncharacterized ABC-type transport system, permease component 

Uncharacterized ABC-type transport system, periplasmic component/surface lipoprotein 



ABC-type metal ion transport system, ATPase component 

ABC-type metal ion transport system, permease component 

ABC-type metal ion transport system, periplasmic component /surface antigen 



FIG. S 23: Annotations for the top genomic sectons - Note that there is no secton along the first component, as seen in Fig. [3] 
(Sectons 7 to 24 are presented on the next page). This shows that the definition of sectons is consistent with our current 
knowledge of gene functions. 
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7 


lOlOH 


Precorrin-3B mcthylasc 




1797H 


Cobyrinic acid a,c-diamide synthase 




2875H 


Precorrin-4 methylase 




2082H 


Prccorrin isomcrasc 




2242H 


Precorrin-6B methylase 2 




2243H 


Precorrin-2 methylase 


8 


1319C 


Acrobic-type carbon monoxide dehydrogenase, middle subunit CoxM/CutM homologs 




19750 


Xanthine and CO dehydrogenases maturation factor, XdhC/CoxF family 




2068R 


Uncharacterized MobA-related protein 




2080C 


Aerobic- type carbon monoxide dehydrogenase, small subunit CoxS/CutS homologs 




1529C 


Aerobic- type carbon monoxide dehydrogenase, large subunit CoxL/CutL homologs 


9 


1176E 


ABC- type spermidine/putrescine transport system, permease component I 




1177E 


ABC-type spermidine/putrescine transport system, permease component II 




687E 


Spermidine/putrescine-binding periplasmic protein 


10 


286V 


Type I restriction-modification system methyltransferase subunit 




610V 


Type I site-specific restriction-modification system, R (restriction) subunit and related helicases 




732V 


Restriction endonuclease S subunits 


11 


1005C 


NADH: ubiquinone oxidoreductase subunit 1 (chain H) 




1007C 


NADH: ubiquinone oxidoreductase subunit 2 (chain N) 




1008C 


NADHiubiquinone oxidoreductase subunit 4 (chain M) 




649C 


NADHiubiquinone oxidoreductase 49 kD subunit 7 




713C 


NADII:ubiquinone oxidoreductase subunit 11 or 4L (chain K) 




838C 


NADH:ubiquinone oxidoreductase subunit 3 (chain A) 




839C 


NADHiubiquinone oxidoreductase subunit 6 (chain J) 




377C 


NADH:ubiquinone oxidoreductase 20 kD subunit and related Fe-S oxidoreductases 


12 


1346M 


Putative effector of murein hydrolase 




1380R 


Putative effector of murein hydrolase LrgA 


13 


17881 


Acyl CoA:acetate/3-ketoacid CoA transferase, alpha subunit 




20571 


Acyl CoA:acetate/3-ketoacid CoA transferase, beta subunit 


14 


1271C 


Cytochrome bd-type quinol oxidase, subunit 1 




1294C 


Cytochrome bd-type quinol oxidase, subunit 2 


15 


163H 


3-polyprenyl-4-hydroxybenzoate decarboxylase 




43H 


3-poIyprenyl-4-hydroxybenzoate decarboxylase and related decarboxylases 


16 


1732M 


Periplasmic: glyc:ine bctaiiic/clioliiic-hiiKiiiig' (lipo)proteiii of an ABC-type transport system 




li25E 


ABC-typo proline /'glyeiue bi'taiiio transport systems, ATPiUSO compoiieuts 


17 


1638G 


TRAP-typc C4-dicarboxylate transport system, periplasmic component 




1593G 


TRAP-type C4-dicarboxylate transport system, large permease component 


18 


1203R 


Predicted helicases 




1518L 


Uncharacterized protein predicted to be involved in DNA repair 


19 


1653G 


ABC-type sugar transport system, periplasmic component 




3839G 


ABC-type sugar transport systems, ATPase components 




395G 


ABC-type sugar transport system, permease component 




1175G 


ABC-type sugar transport systems, permease components 


20 


2894D 


Septum formation inhibitor-activating ATPase 




850D 


Septum formation inhibitor 




851D 


Septum formation topological specificity factor 


21 


23320 


Cytochrome c-type biogenesis protein CcmE 




23860 


ABC-type transport system involved in cytochrome c biogenesis, permease component 




11380 


Cytochrome c biogenesis factor 


22 


1129G 


ABC-type sugar transport system, ATPase component 




1172G 


Ribose/xylose/arabinose/galactoside ABC-type transport systems, permease components 




1879G 


ABC-type sugar transport system, periplasmic component 


23 


3451U 


Type IV secretory pathway, VirB4 components 




3505U 


Type IV secretory pathway, VirD4 components 


24 


1640G 


4-alpha-glucanotransferase 




58G 


Glucan pliospliorylase 




296G 


1,4-alphar-glucan branching enzyme 



